Numpy Arrays and Vectorization

Frequently, matrices and vectors are needed for computation and are a convenient way to store and access data. Vectors are more commonly many rows with a single column. A significant amount of work has been done to make computers very fast at doing matrix math, and while the tradeoff is commonly framed as 'more memory for faster calculation', there is typically enough memory in contemporary computation devices to process chunks of matrices.

In Python's NumPy, vectors and matrices are referred to as arrays: a constant-sized collection of elements (of the same type - integer, floating point number, string of characters, etc.). Underneath, Python arrays use C for greater efficiency.

Note that this is different from the python list - lists are a python datatype, whereas arrays are objects that are made available via the python package numpy.

Array restrictions:

  • You can't append things to an array (i.e. you can't make it bigger without creating an entirely new array)
  • You can only put things of the same type into an array

The array is the basis of all (fast) scientific computing in Python. We need to have a solid foundation of what an array is, how to use it, and what it can do.

By the end of this file you should have seen simple examples of:

  1. Arrays are faster than lists!
  2. Create an array
  3. Different types of arrays
  4. Creating and accessing (indexing) arrays
  5. Building arrays from other arrays (appending)
  6. Operations on arrays of different sizes (broadcasting)
  7. Arrays as Python objects

Further reading:
https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html


In [1]:
# Python imports
import numpy as np

Arrays versus lists

While both data types hold a series of discrete information, arrays are stored more efficiently in memory and have significantly higher performance than Python lists. They also bring with them a host of properties and syntax that makes them more efficient, especially for numeric operations.


In [2]:
l = 20000
test_list = list(range(l))
test_array = np.arange(l)

print(type(test_list))
print(type(test_array))


<class 'list'>
<class 'numpy.ndarray'>

In [3]:
print(test_list[:300]) # Print the first 300 elements 
                       # (more on indexing in a bit):


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299]

In [4]:
print(test_array)


[    0     1     2 ..., 19997 19998 19999]

In [5]:
%timeit [np.sqrt(i) for i in test_list]


10 loops, best of 3: 19.9 ms per loop

In [6]:
%timeit [np.sqrt(test_array)]


10000 loops, best of 3: 49.4 µs per loop

If statement says "10 loops, best of 3: [time]" it means the fastest of 10 repeated runs was recorded - then the 10 runs were repeated twice more, resulting in an overall fastest time.

Creating and accessing (indexing) arrays

We can create arrays from scratch:


In [7]:
test_array = np.array([[1,2,3,4], [6,7,8,9]])
print(test_array)


[[1 2 3 4]
 [6 7 8 9]]

Index arrays using square brackets, starting from zero and specifying row, column:


In [8]:
test_array[0,3]


Out[8]:
4

Arrays are duck typed just like Python variables, that is to say that Python will try to determine what kind of variable it should be based on how it's used.

Numpy arrays are all the same type of variable. To check the data type (dtype) enter:


In [9]:
test_array.dtype


Out[9]:
dtype('int64')

Different variable types use different amounts of memory and can have an effect on performance for very large arrays.

Changing the type of array is possible via:


In [10]:
test_array = test_array.astype('float64')
print(test_array)


[[ 1.  2.  3.  4.]
 [ 6.  7.  8.  9.]]

In [11]:
# We can create arrays of boolean values too:
bool_array = np.array([[True, True, False,True],[False,False,True,False]])
print(bool_array)


[[ True  True False  True]
 [False False  True False]]

We can replace values in an array:


In [12]:
test_array[0,3]=99 # Assign value directly
print(test_array)


[[  1.   2.   3.  99.]
 [  6.   7.   8.   9.]]

Deleting values from an array is possible, but due to the way they're stored in memory, it makes sense to keep the array structure. Often, a 'nan' is used (not a number) or some nonsensical value is used, i.e.: 0 or -1.

Keep in mind that 'nan' only works for some types of arrays:


In [13]:
test_array[0,3] = 'nan'
print(test_array)


[[  1.   2.   3.  nan]
 [  6.   7.   8.   9.]]

Fancy ways of indexing

Slicing Arrays:

Slicing arrays refers to indexing >1 elements in a previous array. Slicing is often used when parallelizing computations using arrays. Indexing is array[row, column].


In [14]:
test_array[:,1]     # Use the ':' to index along one dimension fully


Out[14]:
array([ 2.,  7.])

In [15]:
test_array[1,1:]    # Adding a colon indexes the rest of the values 
                    #    (includes the numbered index)


Out[15]:
array([ 7.,  8.,  9.])

In [16]:
test_array[1,1:-1]  # We can index relative to the first and last elements


Out[16]:
array([ 7.,  8.])

In [17]:
test_array[1,::2]   # We can specify the indexing order


Out[17]:
array([ 6.,  8.])

In [18]:
test_array[1,1::-1] # We can get pretty fancy about it 
                    # Index second row, second from first to second from 
                    #     last in reverse order.


Out[18]:
array([ 7.,  6.])

Logical Indexing

We can specify only the elements we want by using an array of True/False values:


In [19]:
test_array[bool_array] # Use our bool_array from earlier


Out[19]:
array([  1.,   2.,  nan,   8.])

Using the isnan function in numpy:


In [20]:
nans = np.isnan(test_array) 
print(nans)


[[False False False  True]
 [False False False False]]

In [21]:
test_array[nans] = 4
print(test_array)


[[ 1.  2.  3.  4.]
 [ 6.  7.  8.  9.]]

Building arrays from other arrays (appending)

We can build arrays from other array via Python stacking in a horizontal or vertical way:


In [22]:
test_array_Vstacked = np.vstack((test_array, [1,2,3,4]))
print(test_array_Vstacked)


[[ 1.  2.  3.  4.]
 [ 6.  7.  8.  9.]
 [ 1.  2.  3.  4.]]

In [23]:
test_array_Hstacked = np.hstack((test_array, test_array))
print(test_array_Hstacked)


[[ 1.  2.  3.  4.  1.  2.  3.  4.]
 [ 6.  7.  8.  9.  6.  7.  8.  9.]]

We can bring these dimensions back down to one via flatten:


In [24]:
test_array_Hstacked.flatten()


Out[24]:
array([ 1.,  2.,  3.,  4.,  1.,  2.,  3.,  4.,  6.,  7.,  8.,  9.,  6.,
        7.,  8.,  9.])

Caution: appending to numpy arrays frequently is memory intensive. Every time this happens, an entirely new chunk of memory needs to be used, so the old array is moved in memory to a new location.

It's faster to 'preallocate' an array with empty values, and simply populate as the computation progresses.

Operations on arrays of different sizes (broadcasting)

Python automatically handles arithmetic operations with arrays of different dimensions. In other words, when arrays have different (but compatible) shapes, the smaller is 'broadcast' across the larger.


In [25]:
test_array


Out[25]:
array([[ 1.,  2.,  3.,  4.],
       [ 6.,  7.,  8.,  9.]])

In [26]:
print("The broadcasted array is: ", test_array[0,:])
test_array[0,:] * test_array


The broadcasted array is:  [ 1.  2.  3.  4.]
Out[26]:
array([[  1.,   4.,   9.,  16.],
       [  6.,  14.,  24.,  36.]])

However, if the dimensions don't match, it won't work:


In [27]:
print("The broadcasted array is: ", test_array[:,0])
#test_array[:,0] * test_array # Uncomment the line to see that the 
                              #     dimensions don't match


The broadcasted array is:  [ 1.  6.]

In [28]:
# Make use of the matrix transpose (also can use array.T)
np.transpose( test_array[:,0]*np.transpose(test_array) )


Out[28]:
array([[  1.,   2.,   3.,   4.],
       [ 36.,  42.,  48.,  54.]])

Arrays as Python objects

Python can be used as an object oriented language, and numpy arrays have lots of properties. There are many functions we can use as numpy.<function>(<array>) and array.<function>

For example, the transpose above:


In [29]:
print("The original array is: ", test_array)
print("The transposed array is: ", np.transpose(test_array) )

# Alternatively, using test_array as an opject:
print("The transposed array is: ", test_array.transpose() )


The original array is:  [[ 1.  2.  3.  4.]
 [ 6.  7.  8.  9.]]
The transposed array is:  [[ 1.  6.]
 [ 2.  7.]
 [ 3.  8.]
 [ 4.  9.]]
The transposed array is:  [[ 1.  6.]
 [ 2.  7.]
 [ 3.  8.]
 [ 4.  9.]]

One of the most frequenly used properties of arrays is the dimension:


In [30]:
print("The original array dimensions are: ", test_array.shape)
print("The array transpose dimensions are: ", test_array.transpose().shape)


The original array dimensions are:  (2, 4)
The array transpose dimensions are:  (4, 2)

Sorting:

Sorting arrays happens in-place, so once the function is called on an array, the sorting happens to the original array:


In [31]:
test_array2 = np.array([1,5,4,0,1])
print("The original array is: ", test_array2)

test_array3 = test_array2.sort() # Run the sort - note that the new variable isn't assigned
print("The reassigned array should be sorted: ", test_array3)
print("test_array2 after sort: ", test_array2)


The original array is:  [1 5 4 0 1]
The reassigned array should be sorted:  None
test_array2 after sort:  [0 1 1 4 5]